Collection of Internet

home *** CD-ROM | disk | FTP | other *** search

/ Collection of Internet / Collection of Internet.iso / infosrvr / dev / www_talk.930 / 000515_connolly@pixel.convex.com _Fri Jan 8 00:59:21 1993.msg < prev next >

Wrap

Internet Message Format | 1994-01-24 | 10KB

Return-Path: <connolly@pixel.convex.com> Received: from dxmint.cern.ch by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA25970; Fri, 8 Jan 93 00:59:21 MET Received: by dxmint.cern.ch (5.65/DEC-Ultrix/4.3) id AA11465; Fri, 8 Jan 1993 01:14:18 +0100 Received: from pixel.convex.com by convex.convex.com (5.64/1.35) id AA29874; Thu, 7 Jan 93 18:13:48 -0600 Received: from localhost by pixel.convex.com (5.64/1.28) id AA06638; Thu, 7 Jan 93 18:13:46 -0600 Message-Id: <9301080013.AA06638@pixel.convex.com> To: "Thomas A. Fine" <fine@cis.ohio-state.edu> Cc: www-talk@nxoc01.cern.ch Subject: Re: dealing with new-lines In-Reply-To: Your message of "Thu, 07 Jan 93 16:27:50 EST." <9301072127.AA07870@soccer.cis.ohio-state.edu> Date: Thu, 07 Jan 93 18:13:46 CST From: Dan Connolly <connolly@pixel.convex.com> >How should browsers deal with new-lines, and where can html-generators >put in new-lines? Darn good question. Your approach appears to have the correct results, but I'm not sure it's practical for many implementations (global search-and-replace operations are inconvenient for sequential processing models), and it certainly isn't a healthy way to think about SGML documents. The way to think about SGML documents, IMHO, is this: the sequence of characters comprising an SGML document are presented to an SGML parser, which parses the markup from the data and passes the "results" to the processing application. [Much of this is covered in http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/921203/Text.html and what isn't there should be, but bear with me...] So many of the details of the syntax of SGML are invisible to the processing application: the fact that <>'s delimit tags in stead of {}'s for example. The information delivered to the application by the parser is called the Element Structure Information Set. It contains things like tags, attributes, attribute values, and data characters. So there are two questions, to my mind: 1. How does the SGML parser treat newlines? 2. How does the WWW processing application treat newlines? Question 1 is answered by the SGML standard. Question 2 is for us to decide. SGML defines several types of content, which determine the kinds of markup that are recognized inside an element. The simplest is EMPTY, for example: <!ELEMENT P - O EMPTY> When you see a P start tag, you know there is no content, and you assume that a P tag follows, effectively. The next simplest is CDATA, for example: <!ELEMENT TITLE - - CDATA> When you parse the content of a TITLE element, the only thing you look for is an end tag. Everything else is reported by the SGML parser as data characters. Then there is RCDATA, which is just like CDATA, except for character and entity references. The most common content type is MIXED, where all kinds of markup are recognized: tags, entities, as well as data. For example: <!ELEMENT ADDRESS - - (#PCDATA|A|P)> The parser should report start tags, end tags, entities, and data inside an ADDRESS element. Then there is ELEMENT content, where only tags are recognized, for example: <!ELEMENT HEAD - - (TITLE? & ISINDEX? & NEXTID?)> The parser will report any data inside a HEAD element as an error. But [and this is the reason I went through this whole excercise] whitespace is ignored between tags in element content. So the text: <head> <title>sample</title> </head> will be reported to the application by the parser as a HEAD start tag, a TITLE start tag, the data string "sample", a TITLE close tag, and a HEAD close tag, whereas the text: <address> <a HREF="#tim">Tim Berners-Lee</a> </address> will be reported as an ADDRESS start tag, the data string "\n" an A start tag (with an HREF attribute and value), the data string "Tim Berners-Lee", the data string "\n" an A close tag, and an ADDRESS close tag. [There's another content type called ANY, but it's just like MIXED for our purposes.] >I spent quite a bit of time thinking about what is intuitively the right >way to do it, and I came up with this method. > >0. Convert all new-lines inside of tags to spaces. Newlines inside tags are the responsibility of the SGML parser. I suggest you use the excellent sgmls parser to test your rules by trial and error, or consult the standard. I have done both, and the results of my labors are available in libHTML. I have also done an elisp implementation, if anybody's interested. The one tricky case is newlines inside attribute value literals, e.g. <foo bar="12 3"> This one is a little tricky. SGML section 7.9.3 says: "An attribute value literal is interpreted as an attribute value by replacing references within it, ignoring Ee and RS, and replacing RE or SEPCHAR with SPACE." The reference concrete syntax assigns the conventional unix newline character, ASCII code 10, to the role of RS. So strictly speaking, it should be ignored, and the value of the attribute is "123". On the other hand, the sgmls parser does a little behind-the-scenes magic on newlines. From the sgmls man page: An external entity resides in one or more files. The entity manager component of sgmls maps a sequence of files into an entity in three sequential stages: 1. each carriage return character is turned into a non- SGML character; 2. each newline character is turned into a record end character, and at the same time a record start charac- ter is inserted at the beginning of each line; 3. the files are concatenated. [This sort of thing _does_ still conform to the SGML standard. You're allowed to do magic while assembling entities] So using sgmls, the newline in this case would be treated as RE, and converted to SPACE, i.e. ASCII character 32, by the parser. So the value of the bar attribute is "12 3". It's a question of how we construct SGML entities from HTML data streams. >1. For each tag NOT in > <PRE> </PRE> <A> </A> <PLAINTEXT> > remove ALL surrounding new-lines. First, let's get one thing straight: the PLAINTEXT element as described by the original HTML documentation is not representable in SGML. For my purposes, I consider the HTML document to end at the <PLAINTEXT> tag, and I consider the rest of the data stream to be an RFC-822 message body or a MIME text/plain body, and not SGML at all. Next, let's keep in mind that you can't do things like the following global substitition, s/\n+(<(H1|H2|ADDRESS...))>/$2/g; because it might find things that look like tags but aren't, for example <foo bar=" <H1>this is a little cooky, but nontheless legal and possible."> But even if you're using a proper SGML parser, consider: <H1>Here we go! <a href="#xyz">click here</a> There we went! </H1> The parser will return an H1 start tag, and then the string "Here we go!\n". At this point, your rule doesn't tell me what to do with the newline. I have to get the next object before I decide. Hmm... I guess that's reasonable. But I'd rather just pass all the data charcters on the the text formatter and let it figure all this out. Do we want to specify rules for the text formatter? If so, we need to go beyond just newlines. I see some data providers writing things like: <H1>Here are some things to consider:</H1> <p> thing one <p> thing two <p> thing three The MidasWWW browser displays this as Here are some things to consider: thing one thing two thing three which I think is reasonable. The provider should either use <H1>Here are some things to consider</H1> <UL> <li>thing one <li>thing two <li>thing three </ul> or at a minimum, <H1>Here are some things to consider</H1> <PRE> thing one thing two thing three </PRE> >2. For each tag in > <PRE> <PLAINTEXT> > remove ALL new-lines to left, and one new-line to the right. Why remove one new-line to the right? Just for HTML source file aesthetics? >If XMP and LISTING sections are being used, they would be treated the >same as PRE. > >Note that this converts new-lines around anchors into spaces UNLESS they >appear immediately at the beginning or end of some other element. There are also some new elements that act like A: EM, CODE, SAMP, etc. >If browsers use this method, it would allow html-generators to put in >new-lines all over the place for readability of HTML, without introducing >lots of annoying extra spaces in the output. This is what seems like >the most useful thing to do, although I'm not sure it is "correct". > >So is it correct? And are there any obvious flaws? We have not specified the rules for typesetting elements other than XMP, LISTING, and PRE before now, so what you suggest is as correct as anything else. I think it's important that we agree on how to typeset the <PRE> element. [And I think getting rid of the first newline after a <PRE> tag is a Bad Thing.] It's not important to me that we agree how to typeset other elements. I'm inclined to give formatters great leeway with how they treat whitespace. I wouldn't mind at all if something like <H1>testing 123</H1> foo bar blech icky<a>wicky</a> woo <p> abc defghi jhkl sldjkf sld lsdjkf were typeset as: TESTING 123 foo bar blech ickywicky woo abc defghi jhkl sldjkf sld lsdjkf My point is: don't use whitespace to represent significant information except in the PRE elemnt. Use the tags that are defined to have significance. Dan